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(54) Extraction of information from structured documents 



(57) A method of extracting information from a struc- 
tured document includes the steps of assigning a partial 
tree identifier inclusive of a tag identifier to a selected 
partial tree wherein the tag identifier includes a name of 
a tag corresponding to a root of the selected partial tree, 
a name of at least one format attribute of the tag, and a 
value of the at least one format attribute, arranging 



names of format attributes in a predetermined order in 
the tag identifier if the at least one format attribute of the 
tag includes two or more format attributes, and identify- 
ing a partial tree having a partial tree identifier identical 
to the partial tree identifier of the selected partial tree 
from a list of partial tree identifiers of partial trees that 
exist in the structured document after updating thereof. 
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Description 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0001] The present invention generally relates to a 
method of extracting information from structured docu- 
ments such as HTML documents or the like, and partic- 
ularly relates to an information extraction method that 
identifies and extracts a desired text portion selected in 
advance from daily updated structured documents. Fur- 
ther, the present invention relates to a user interface by 
which a desired portion can readily be selected in a 
structured document. 

2. Description of the Related Art 

[0002] There are needs for a means to select a par- 
ticular portion from a structured document such as an 
HTML (hyper text markup language) document or the 
like that is dairy updated. For example, a user may wish 
to select portions of particular interest from Web pages 
that the user is familiar with, putting these portions to- 
gether to create a collection of information which allows 
the user to readily view only necessary information. 
When the source of collected information is daily updat- 
ed, the selected portion needs to be identified again and 
again in the daily updated document for use in the col- 
lection. 

[0003] Japanese Patent No. 2867986 directed to a 
WWW information extraction system teaches storing in- 
formation indicative of a start point and an end point of 
a portion selected in advance. Based on this informa- 
tion, the start point and the end point are identified in 
the updated document, followed by extracting the por- 
tion existing between these two points as the selected 
portion. For example, texts corresponding to the start 
point and the end point, respectively, of the selected por- 
tion are stored in memory. When extracting the selected 
portion from the document, the stored texts are used to 
identify the start point and the end point in the HTML 
document, followed by extracting the identified portion. 
[0004] A system proposed by webMethods corpora- 
tion (http://www.w3.org/TRWOTE-widl) and a system 
proposed by Luca locchi (Luca locchi: The Web-OEM 
approach to Web information extraction, Journal of Net- 
work and Computer Applications, Vol.22, pp. 259-269 
(1 999)) approach this issue by converting an HTML doc- 
ument into a tree structure, storing information about a 
partial tree corresponding to a portion selected in ad- 
vance, and identifying a portion of the updated docu- 
ment that corresponds to the stored partial tree. Here, 
information about a partial tree is comprised of a char- 
acter string serving as an identifier of the selected por- 
tion. A tag name is used as a tag identifier, and tag 
names at the same hierarchical level in the tree structure 
are provided with respective numerical value indexes. 



The tag names paired with the numerical value indexes 
are connected in series to make the character string for 
representation of a structure from the root of the whole 
tree to the root of the partial tree, which corresponds to 
5 the selected portion. In an example of Fig.1 , "doc" is re- 
garded as the root of the whole tree, and the identifier 
that points to the selected portion "local news" is repre- 
sented as "doc.table[0].table[0]". 
[0005] In the related-art method disclosed in Japa- 
10 nese Patent No. 2867986 regarding the WWW informa- 
tion extraction system, a selected portion is extracted 
based on the information indicative of the start point and 
end point of the selected portion. It naturally follows that 
such information needs to be an item that always re- 
's mains intact in the document after updating. It is difficult, 
however, to identify enduring information that is un- 
changed through updating. Many exceptions exist on 
homepages on the Internet as designs of such 
homepages tend to be at the designers' discretion, so 
20 that the method as described above may not be appli- 
cable to a wide range of application areas. 
[0006] If texts corresponding to the start and end 
points are used as a due in the WWW information ex- 
traction system, these texts themselves may be subject- 
25 ed to updating as shown in Fig .2. In such a case, this 
method fails. 

[0007] Further, if a selected portion is extracted as 
shown in Fig.3A by this method, the extracted portion 
does not constitute a proper partial tree as a tree struc- 
30 ture, an example of which is shown in Fig.3B. Because 
of this, difficulties would be encountered if an attempt is 
made to make use of this extracted portion in another 
structured document. 

[0008] The method utilizing the identifier of a partial 

35 tree of a selected portion as taught by the webMethods 
corporation or Luca locchi relies on the premise that the 
document structure does not change through updating. 
If the document structure ever slightly changes through 
updating, the identifier of a partial tree selected in ad- 

40 vance will not match an identifier after updating. 

[0009] For example, a text block having the same tag 
as an existing tag may be inserted into the same hier- 
archical level of the tree structure to which the selected 
portion of the document belongs. This results in a nu- 

45 mericat value index of the tag being changed in the iden- 
tifier of the partial tree. In the example of Rg.1 , the doc- 
ument is updated by inserting the text regarding "AD- 
VERTIS EM ENT 2" bracketed in table tags above the se- 
lected portion. As a result, the numerical value index of 

so the tag identifier based on the tag name "table" in re- 
spect of the selected "local news" is change from table 
[0]" to "table[1]". Such small format changes are likely 
to be made on a site top page where banners, breaking 
news, etc., are inserted and deleted constantly. Since 

55 such a site as having constant updating of information 
is the very kind of site that users wish to select portions 
from, the degradation of reliability of portion identifica- 
tion needs to be addressed if such degradation occurs 
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through minor updating. 

[0010] When a tag that was not in existence at the 
time of the portion selection is inadvertently left open 
above the selected portion, this tag appears as a parent 
node relative to the selected portion. In the example of s 
updating shown in Fig.1, the table tag inclosing "AD- 
VERTISEMENT 1" above the selected portion is inad- 
vertently left open. As a consequence, an identifier that 
should correctly appear as "doc.table{0].table[0]" be- 
comes "doc.table[0].table[0].table{1]", which indicates 10 
the existence of a table tag as a parent node of the se- 
lected portion "local news". This makes the identifier of 
the partial tree fail to match between before and after 
updating. WWW browsers widely used today permit 
open-ended tags, and page designers often update « 
pages without noticing the fact that open-ended tags are 
present in the pages. 

[0011] Insertion of a text block having the same tab 
and inadvertent lack of a closing tag causes a trouble in 
the example of updating of the document shown in Fig. 20 
1 . Namely, the identifier of a partial tree that points to 
the selected portion is changed from "doc.table[0].table 
[0]" to "doc.table[0].table(0].table[1]". 
[001 2] The meth ods proposed by the webMethod cor- 
poration and locchi further have a problem in that know!- 25 
edge of tags and document structures and skill are nec- 
essary when selecting a portion in a structured docu- 
ment such as an HTML document. 

SUMMARY OF THE INVENTION 30 

[0013] It is a general object of the present invention 
to substantially obviate one or more problems caused 
by the limitations and disadvantages of the related art. 
[0014] It is another and more specific object of the 3s 
present invention to provide a method of extracting in- 
formation from a structured document that can extract 
a selected portion without having reliability degraded 
through updating of the document. 
[001 5] It is still another object of the present invention *o 
to provide a method of selecting and extracting a portion 
from a structured document by which the user can select 
the portion of the structured document such as an HTML 
document in a manner that is intuitively easy to under- 
stand. 45 
[0016] According to the invention, a tag identifier is 
comprised of a name of a tag, a name of at least one 
format attribute of the tag, and a value of the at least 
one format attribute, and is used as a partial tree iden- 
tifier. With this partial tree identifier, the reliability of per- so 
tion extraction fs not degraded because the start and 
end points are not relied upon. It suffices to have only a 
different format attribute for a tag even if a text block 
having the same tag as the partial tree of a selected por- 
tion is inserted into the same hierarchical level where ss 
the selected portion belongs. 

[0017] Further, numerical value indexes are generat- 
ed that indicate the sequence numbers of tag identifiers 



belonging to the same hierarchical level of the tree struc- 
ture. A tag identifier and a numerical value index are 
paired as a set, and a plurality of sets are connected in 
series from the root of the whole tree structure to the 
root of a partial tree, thereby providing the partial tree 
identifier. With this provision, it is possible to uniquely 
identify the selected portion even if the same combina- 
tion of a tag and format attributes that corresponds to 
the root of the selected partial tree is used for other tags 
in the document. 

[0018] If there are two or more matching partial trees 
at the time of identifying a partial tree, the matching of 
identifiers is recursively performed by successively as- 
cending to a next higher parent node. This makes it pos- 
sible to avoid the degradation of the reliability of portion 
extraction even if there is a tag that is inadvertently left 
open above the selected portion. 
[001 9] According to another aspect of the present in- 
vention, the system for selecting and extracting a portion 
of a structured document such as an HTML document 
detects an end node of a tree structure that corresponds 
to a position indicated by a user on the screen displaying 
the structured document. A series of ancestor nodes are 
successively obtained for visual presentation on the 
screen, and the user is prompted to select a node. This 
allows the user to easily select a portion of the structured 
document according to node selection, so that the se- 
lected portion will be readily reused in another struc- 
tured document. 

[0020] Other objects and further features of the 
present invention will be apparent from the following de- 
tailed description when read in conjunction with the ac- 
companying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0021] 

Fig.1 is an illustrative drawing for explaining the re- 
lated art; 

Fig.2 is an illustrative drawing showing an example 
in which a portion of a document is selected and 
extracted by using the texts indicative of start and 
end points; 

Figs.3A and 3B are illustrative drawings showing an 
example in which a portion of an HTML document 
is extracted by using the texts indicative of start and 
end points; 

Fig.4 is a flowchart showing a schematic of the 
present invention; 

Fig.5 is an fTTustratrve drawing for explaining a case 
in which a plurality of tags have the same tag and 
attribute formats; 

Rg.6 is a block diagram of a system for extracting 
information from a structured document according 
to a first embodiment of the present invention; 
Fig.7 is an illustrative drawing showing an example 
of a displayed page for portion selection according 
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to the first embodiment of the present invention; 
Fig.8 is a drawing showing an example of informa- 
tion stored in the portion- information storage unit 
according to the first embodiment of the present in- 
vention; 5 
Fig.9 is a drawing showing an example of a tree- 
structure data generated by the document structure 
analysis of the first embodiment of the present in- 
vention; 

Fig. 1 0 is a drawing showing the contents of the por- 10 
tion-information storage unit according to the first 
embodiment of the present invention; 
Fig. 11 is a drawing showing an example of an ele- 
ment list according to the first embodiment of the 
present invention; 15 
Fig.12 is a drawing showing an example of a con- 
verted tree-data structure according to the first em- 
bodiment of the present invention; 
Fig.1 3 is a flowchart of a method of extracting infor- 
mation from a structured document according to the *o 
first embodiment of the present invention; 
Fig.1 4 is a block diagram of a system for extracting 
information from a structured document according 
to a second embodiment of the present invention; 
Fig. 1 5 is an illustrative drawing showing the gener- 25 
ation of a partial tree identifier of a selected portion 
according to a second embodiment of the present 
invention; 

Fig. 1 6 is an illustrative drawing showing the gener- 
ation of partial tree data according to a second em- 30 
bodiment of the present invention; 
Fig.1 7 is a flowchart of a method of extracting infor- 
mation from a structured document according to the 
second embodiment of the present invention; 
Fig. 1 8 is a block diagram of a system for extracting 35 
information from a structured document according 
to a third embodiment of the present invention; 
Fig.1 9 is a flowchart of a method of extracting infor- 
mation from a structured document according to the 
third embodiment of the present invention; #> 
Fig.20 is a flowchart showing a schematic of a user 
interface of the present invention; 
Fig. 21 is a block diagram of a schematic user inter- 
face according to the present invention; 
Fig.22 is a block diagram of an apparatus according 45 
to an embodiment of the present invention; 
Fig.23 is a flowchart of a method of selecting and 
extracting a portion according to an embodiment of 
the present invention; 

Fig.24 is an illustrative drawing showing an exam- so 
pie of a portion selection on a browser according to 
the embodiment of the present invention; 
Fig.25 is a flowchart showing an operation of a tree- 
structure generating unit according to the embodi- 
ment of the present invention; 55 
Fig.26 is a flowchart of an operation of the selected 
portion marking unit according to the embodiment 
of the present invention; 



Fig.27 is an illustrative drawing showing an exam- 
ple of a tree structure and the associated presenta- 
tion of selected portions according to the embodi- 
ment of the present invention; 
Fig.28 is an illustrative drawing showing a construc- 
tion of the system according to an embodiment of 
the present invention; and 
Fig.29 is an illustrative drawing showing an exam- 
ple of an HTML source, an associated tree struc- 
ture, and associated browser presentation. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

[0022] In the following, embodiments of the present 
invention will be described with reference to the accom- 
panying drawings. 

[0023] Rg.4 is a flowchart showing a schematic of the 
present invention. 

[0024] A method of extracting information from a 
structured document according to the present invention 
converts a document into a tree structure, and gener- 
ates an identifier of a partial tree corresponding to a por- 
tion of the document, thereby specifying any desired 
portion of the structured document in advance and pro- 
viding a basis for subsequently identifying the selected 
portion from the updated document 
[0025] As shown in Fig.4, this method uses a tag iden- 
tifier as an identifier of a partial tree where the tag iden- 
tifier is comprised of a tag name corresponding to the 
root of the partial tree, names of one or more format at- 
tributes of the tag, and the values of the format attributes 
(step 1). If there are a plurality of format attributes for 
the tag identifier, the format attributes are arranged in a 
predetermined order (e.g., alphabetical order) of the for- 
mat attribute names to normalize the tag identifier (step 
2). A partial tree having the same identifier as the al- 
ready selected partial tree is identified as the selected 
portion from the list of identifiers of partial trees that exist 
in the document converted into a tree structure (step 3). 
[0026] If the same combination of a tag name and for- 
mat attributes that represent the root of the selected par- 
tial tree is used for two or more tags in the document as 
shown in Fig .5, numerical value indexes are generated 
that indicate the sequence numbers of tag identifiers be- 
longing to the same hierarchical level of the tree struc- 
ture. A tag identifier and a numerical value index are 
paired as a set, and a plurality of sets are connected in 
series from the root of the whole tree structure to the 
root of the selected partial tree, thereby providing the 
identifier of the partial tree. 

[0027] There is then a need to identify the partial tree 
having the same identifier as the already selected partial 
tree from a list of identifiers of partial trees that are 
present in the document converted into a tree structure. 
Matching of partial tree identifiers is performed by taking 
into consideration only the tag identifier of the root of the 
selected partial tree. If there are two or more partial trees 
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that match the selected partial tree, then, the numerical 
value index associated with the tag identifier is matched 
to screen the candidates. If more than one candidate 
still remains after the screening of candidates based on 
the utilization of the numerical value index, a parent- 
node tag is then taken into consideration for matching 
of identifiers. The matching of identifiers is recursively 
performed by ascending to successive ancestor nodes 
until only one partial tree remains as a candidate. This 
remaining tree is identified as the selected partial tree. 

[FIRST EMBODIMENT] 

[0028] Fig.6 is a block diagram of a system for extract- 
ing information from a structured document according 
to the first embodiment of the present invention. 
[0029] In the system as shown, the reliability of portion 
extraction is degraded since it operates based on a 
method independent of the start and end positions of a 
selected portion. Namely, it suffices to have only differ- 
ent format attributes for a tag even if a text block having 
the same tag as the partial tree of the selected portion 
is inserted into the same hierarchical level where the se- 
lected portion belongs. 

[0030] The system of Fig.6 includes a portion select- 
ing unit 1 for receiving instruction from a user that se- 
lects a portion in a structured document, a portion-infor- 
mation storage unit 2 for storing information about the 
selected portion, a document-structure analyzing unit 3 
which identifies a partial tree in the tree structure by use 
of tags and associated format attributes, and a portion 
identifying unit 4 for returning a document portion cor- 
responding to the selected portion upon user request. 
[0031] The portion selecting unit 1 includes a docu- 
ment retrieving unit 11 , a portion specifying unit 12, and 
a document structuring unit 13. 
[0032] The document retrieving unit 1 1 receives a re- 
quest for document retrieval from the portion specifying 
unit 12 where the request specifies a URL (uniform re- 
source locator) serving as an identifier of a document. 
The document retrieving unit 11 then retrieves the re- 
quested document, and gives it to the portion specifying 
unit 12. 

[0033] The portion specifying unit 1 2 sends to the doc- 
ument retrieving unit 1 1 a request for document retrieval 
with a URL, and obtains the document. The portion 
specifying unit 12 then requests the document structur- 
ing unit 13 to structure the document, and obtains the 
document converted into a tree structure. As shown in 
Fig.7, the portion specifying unit 12 provides a user in- 
terface that helps the user to specify a portion In the doc- 
ument. An identifier of a partial tree is generated accord- 
ing to the coordinates or the like of the specified portion. 
This identifier together with the URL are stored in the 
portion-information storage unit 2 as shown in Rg.8. 
[0034] The document structuring unit 13 requests the 
document-structure analyzing unit 3 to structure the 
document that is received from the portion specifying 



unit 12. The document structuring unit 13 then receives 
the document converted into a tree structure as a data 
structure representing parent-child relations in the tree 
structure as shown in Fig.9. For example, tags and text 
5 elements constituting the tree structure are represented 
by an object ID, a label, a child-node list, and a partial 
tree identifier. A list of these items is received as the 
data structure. 

[0035] The portion-information storage unit2 receives 

10 the URL and the partial tree identifier from the portion 
specifying unit 12, and assigns a document portion ID 
for identifying the set of the URL and the partial tree 
identifier. This set and the assigned document portion 
ID are stored as shown in Fig.1 0. The document portion 

'5 ID is then returned to the portion specifying unit 1 2. 
[0036] The document-structure analyzing unit 3 in- 
cludes a tree-structure conversion unit 31 and a partial- 
tree-identifier generating unit 32. 
[0037] The tree-structure conversion unit 31 receives 

20 a document structuring request together with the struc- 
tured document from the document structuring unit 13 
or 43. The tree-structure conversion unit 31 converts the 
received document into a tree structure having tags and 
texts as document elements, and sends the converted 

25 document to the partial-tree-identifier generating unit 
32. 

[0038] The partial-tree- identifier generating unit 32 
generates a tag identifier for each tag constituting the 
document that is converted into the tree structure by the 

30 tree-structure conversion unit 31 . The tag identifier is 
comprised of a tag name, a name of a format attribute, 
and a value of the format attribute. In an example of Fig. 
9, the first "table" tag is given a tag identifier 
"table_border=0&cellpadding=1 which combines a tag 

35 name table - and format attributes and their values 'bor- 
der="0" cellpadding= - 1 "\ If there are two or more format 
attributes, they are arranged in a predetermined order 
of the format attribute names to normalize the tag iden- 
tifier. The tag identifier obtained in this manner is used 
as an identifier of a partial tree that has this tag as its 
root, and is matched with a corresponding tree-structure 
element. Tree-structure data inclusive of partial tree 
identifiers as shown in Fig.9 is then send to the docu- 
ment structuring unit 13 or 43. 

45 [0039] The portion identifying unit 4 includes a docu- 
ment retrieval unit 41 , a partial-tree-identifier identifying 
unit 42, and the document structuring unit 43. 
[0040] The document retrieval unit 41 receives a doc- 
ument retrieval request together with a URL serving as 

so a document identifier from the partial-tree-identifier 
Identifying unit 42. Upon receipt of the request, the doc- 
ument retrieval unit 41 obtains the document from the 
Internet, and returns the document to the partial-tree- 
identifier identifying unit 42. 

55 [0041 ] The partial-tree-identif ier identifying unit 42 re- 
ceives a portion retrieval request together with the doc- 
ument portion ID from the user, and transfers the docu- 
ment portion ID to the portion-information storage unit 
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2 to obtain the relevant URL and the corresponding par- 
tial tree identifier. The partial-tree-identifier identifying 
unit 42 supplies the URL to the document retrieval unit 
41 to obtain the corresponding document. A request is 
then sent to the document structuring unit 43 for struc- 
turing of the obtained document, and a list of elements 
of the converted tree structure as shown in Fig.1 1 is ob- 
tained in response. The partial-tree-identifier identifying 
unit 42 extracts a tag from the obtained list of elements 
by finding the tag that corresponds to the partial tree 
identifier. The partial-tree-identifier identifying unit 42 
then provides the user with a document portion corre- 
sponding to the partial tree belonging to the extracted 
tag. 

[0042] The document structuring unit 43 requests the 
document-structure analyzing unit 3 to structure the 
document that is received from the partial-tree-identifier 
identifying unit 42. The document structuring unit 43 
then receives the document converted into a tree struc- 
ture as a data structure representing parent-child rela- 
tions in the tree structure as shown in Fig. 12. For exam- 
ple, tags and text elements constituting the tree struc- 
ture are represented by an object ID, a label, a child- 
node list, and a partial tree identifier. A list of these items 
is received as the data structure. 
[0043] Inthefollowing, an operation of the system will 
be described. 

[0044] Rg. 1 3 is a flowchart of a method of extracting 
information from a structured document according to the 
first embodiment of the present invention. 
[0045] This operation includes a portion selection 
process A, a portion identification process B, and a doc- 
ument structuring process M. In the following, steps will 
be described with one of the three designations A, B, 
and M. 

[0046] First, the portion selection process A will be de- 
scribed. 

[0047] At step A10, the portion specifying unit 12 re- 
sponds to a user instruction with an associated URL by 
having the document retrieving unit 11 obtain a docu- 
ment corresponding to the URL from the Internet. The 
portion specifying unit 12 sends the received document 
to the document structuring unit 1 3 for structuring of the 
document. The procedure goes to step M10. 
[0048] At step M1 0, the tree-structure conversion unit 
31 receives the structured document from the document 
structuring unit 13, and converts the document into a 
tree structure having tags and texts as document ele- 
ments, which is supplied to the partiai-tree-identifier 
generating unit 32. The procedure then goes to step 
M20. 

[0049] At step M20, the partial-tree-identifier generat- 
ing unit 32 generates a tag identifier for each tag con- 
stituting the document that is converted into the tree 
structure by the tree-structure conversion unit 31 . The 
tag identifier is comprised of a tag name, a name of a 
format attribute, and a value of the format attribute. In 
an example of Fig. 9, the "table" tag is given a tag iden- 



tifier "table_border=0 & cell padding=1", which com- 
bines a tag name "table" and format attributes and their 
values 'border="0" cellpadding="1"\ If there are two or 
more format attributes, they are arranged in a predeter- 

5 mined order of the format attribute names to normalize 
the tag identifier. The tag identifier obtained in this man- 
ner is used as an identifier of a partial tree, and is 
matched with a corresponding tree-structure element. 
Tree-structure data inclusive of partial tree identifiers as 

10 shown in Fig.9 is then send to the document structuring 
unit 13. 

[0050] At step A20, the portion specifying unit 1 2 iso- 
lates a portion selected by the user through user inter- 
face that provides the user with a means of easy selec- 
ts tion as shown in Fig.7. The procedure then goes to step 
A30. 

[0051] At step A30, the portion specifying unit 1 2 ob- 
tains a partial tree identifier corresponding to the select- 
ed portion form the coordinates or the like of a selected 

20 area as shown in Fig.8. The obtained partial tree iden- 
tifier and the document URL are stored as a pair in the 
portion-information storage unit 2, and the document 
portion ID corresponding to the stored pair is acquired. 
[0052] In what follows, the portion identification proc- 

25 ess B will be described. 

[0053] At step B1 0, the partial-tree-identifier identify- 
ing unit 42 receives a portion retrieval request together 
with a document portion ID from the user. The partial- 
tree-identifier identifying unit 42 transfers the document 

30 portion ID to the portion-inf ormation storage u nit 2 to ob- 
tain the relevant URL and the corresponding partial tree 
identifier. The procedure then goes to step B20. 
[0054] At step B20, the partial-tree-identifier identify- 
ing unit 42 obtains a document corresponding to the ob- 

35 tained URL by using the document retrieval unit 41 . The 
partial-tree-identifier identifying unit 42 passes the ob- 
tained document to the document structuring unit 43, 
and issues a document structuring request. The proce- 
dure proceeds to step M10. 

40 [0055] At step M 1 0 , the tree-structure conversion unit 
31 receives the structured document from the document 
structuring unit 43, and converts the document into a 
tree structure inclusive of document tags and texts. The 
tree-structure conversion unit 31 supplies the tree struc- 

^5 ture to the partial-tree-identifier generating unit 32. The 
procedure proceeds to step M20. 
[0056] At step M20, the partial-tree-identifier generat- 
ing unit 32 generates a tag identifier for each tag con- 
stituting the document that is converted into the tree 

so structure by the tree-structure conversion unit 31 . Trie 
tag rdentffTer Is comprised of a tag name, a name of a 
format attribute, and a value of the format attribute. In 
an example of Fig.9, the "table" tag is given a tag iden- 
tifier "table_border=0 & cellpadding=1", which com- 

55 bines a tag name "table" and format attributes and their 
values t>order="0" cell padding="1"\ If there are two or 
more format attributes, they are arranged in a predeter- 
mined order of the format attribute names to normalize 
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the tag identifier. The tag identifier obtained in this man- 
ner is used as an identifier of a partial tree, and is 
matched with a corresponding tree-structure element. 
Tree-structure data inclusive of partial tree identifiers as 
shown in Fig. 9 is then send to the document structuring s 
unit 43. The procedure then goes to B30. 
[0057] At B30, the partial-tree-identifier identifying 
unit 42 finds a tag corresponding to the obtained partial 
tree identifier from the list of elements of the converted 
tree structure as shown in Fig. 11 . If no corresponding to 
partial tree structure identifier is found, the procedure 
comes to an end. If a corresponding partial tree structure 
identifier is found, the procedure proceeds to step B40. 
[0058] At step B40, the partial-tree-identifier identify- 
ing unit 42 provides the user with a document portion '5 
belonging to a partial tree that corresponds to the ob- 
tained partial tree identifier. 

[SECOND EMBODIMENT] 

20 

[0059] Fig. 14 is a block diagram of a system for ex- 
tracting information from a structured document accord- 
ing to a second embodiment of the present invention. 
[0060] The system as shown is directed to a configu- 
ration that can uniquely identify a selected portion even 25 
if the same combination of a tag and format attributes 
that corresponds to the root of a selected partial tree is 
used for other tags in the document. 
[0061] The configuration of the second embodiment 
is identical to that of the first embodiment, except for a 30 
partial-tree-identifier generating unit 32a of the docu- 
ment-structure analyzing unit 3. A description of the 
identical portion will be omitted in the following. 
[0062] The partial-tree-identrfier generating unit 32a 
generates a tag identifier for each tag constituting the 3s 
document that is converted into the tree structure by the 
tree-structure conversion unit 31 . The tag identifier is 
comprised of a tag name, a name of a format attrtoute, 
and a value of the format attribute. In an example of Fig. 
15, the first "table" tag is given a tag identifier 
*table_border=0&celtpadding=1 ", which combines a tag 
name "table" and format attributes and their values *bor- 
der='0" cellpadding="1 "\ If there are two or more format 
attributes, they are arranged in a predetermined order 
of the format attribute names to normalize the tag iden- ^ 
tifier. 

[0063] Numerical value indexes are then generated 
that indicate the sequence numbers of tag identifiers be- 
longing to the same hierarchical level of the tree struc- 
ture. A tag identifier and a numerical value index are so 
paired as a set, and a pfuralrty of sets are connected fh 
series from the root of the whole tree structure to the 
root of a partial tree, thereby providing the identifier of 
the partial tree as shown in Fig. 15. The tree-structure 
data inclusive of partial tree identifiers as shown in Fig. 55 
16 is then supplied to the document structuring unit 13 
or 43. 

[0064] Fig. 1 7 is a flowchart of a method of extracting 



information from a structured document according to the 
second embodiment of the present invention. 
[0065] In the following, a description will be omitted in 
respect of steps other than step M20a as these steps 
are identical to those of the first embodiment 
[0066] At step M20a, the partial-tree-identrfier gener- 
ating unit 32a generates a tag identifier for each tag con- 
stituting the document that is converted into the tree 
structure by the tree-structure conversion unit 31 . The 
tag identifier is comprised of a tag name, a name of a 
format attribute, and a value of the format attribute. In 
the example of Fig.1 5, the first "table" tag is given a tag 
identifier table_border=0&cellpadding=1 ", which com- 
bines a tag name "table" and format attributes and their 
values 'border="0" cellpadding="1 m . If there are two or 
more format attributes, they are arranged in a predeter- 
mined order of the format attribute names to normalize 
the tag identifier. The procedure then goes to step M30a. 
[0067] At step M30a, the partial-tree-identrfier gener- 
ating unit 32a generates numerical value indexes that 
indicate the sequence numbers of tag identifiers belong- 
ing to the same hierarchical level of the tree structure, 
and combines each tag identifier with a corresponding 
numerical value index as a set. A plurality of sets are 
connected in series from the root of the whole tree struc- 
ture to the root of a partial tree, thereby providing the 
identifier of the partial tree as shown in Fig. 1 5. The tree- 
structure data inclusive of partial tree identifiers as 
shown in Fig.t 6 is then supplied to the document struc- 
turing unit 13 or 43. The procedure then proceeds to 
step A20 or step B30. 

[THIRD EMBODIMENT] 

[0068] Fig. 18 is a block diagram of a system for ex- 
tracting information from a structured document accord- 
ing to a third embodiment of the present invention. 
[0069] The system as shown is directed to a configu- 
ration that can avoid the degradation of reliability of por- 
tion extraction even if an open-ended tag exists above 
the selected portion. 

[0070] The configuration of the third embodiment is 
identical to that of the first embodiment, except f o r a par- 
tial-tree-identifier identifying unit 42a. A description of 
the identical portion will be omitted in the following. 
[0071] The partial-tree-identifier identifying unit 42a 
receives a portion retrieval request with a document por- 
tion ID from a user, and passes the document portion 
ID to the portion-information storage unit 2 to obtain the 
relevant URL and the corresponding partial tree identi- 
fier. The URLfe then transferred to the document retriev- 
al unit 41 to obtain the corresponding document The 
partial-tree-identifier identifying unit 42a sends a re- 
quest to the document structuring unit 43 to structure 
the received document, thereby obtaining a list of ele- 
ments of the converted tree structure as shown in Fig. 
16. 

[0072] Thepartialtreeidentifierobtainedfromthepor- 
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tion-information storage unit 2 needs to be identified 
from a list of partial tree identifiers of the obtained ele- 
ments. A tag identifier located at the end of the identifier 
is used alone for the matching purpose. In the case of 
"doc.table_border=1 &cellpadding=1 [0]. 
table_border=0& ceil padding=1[1]", for example, a tag 
identifier at the end of the partial tree identifier refers to 
table_border=0&cellpadding=1[1]" provided at the end 
of the string. When there are two or more candidates 
that match the selected partial tree, the numerical value 
indexes associated with the tag identifiers are referred 
to in order to screen the candidates. 
[0073] If more than one candidate still remains after 
the screening of candidates based on the utilization of 
the numerical value index, a parent- node tag is then tak- 
en into consideration for matching of identifiers. The 
matching of identifiers is recursively performed by as- 
cending to successive ancestor nodes until only one 
partial tree remains as a candidate. This remaining tree 
is identified as the selected partial tree. The user is pro- 
vided with a document portion belonging to the partial 
tree that corresponds to the identified partial tree iden- 
tifier. 

[0074] Fig. 1 9 is a flowchart of a method of extracting 
information from a structured document according to the 
third embodiment of the present invention. In the follow- 
ing, a description will be omitted in respect of steps other 
than steps B30a through B90a to avoid a duplicate de- 
scription identical to that of the first embodiment. 
[0075] At B30a, the partial-tree-identifier identifying 
unit 42 needs to identify the partial tree identifier ob- 
tained from the portion-information storage unit 2 from 
the list of elements of the converted tree structure as 
shown in Fig.16. To this end, the partial-tree-identifier 
identifying unit 42 chooses a tag identifier at the end of 
the identifier for use as a matching element. In the case 
of "doc.tabie_border=1 &cellpaddinp=1 [0]. 

table_border=0&ce{lpadding=1[1f , for example, a tag 
identifier at the end of the partial tree identifier refers to 
"table_border=0 & cell padding=1[1]' provided at the 
end of the string. After this tag identifier is chosen, the 
procedure goes to step B40a. 

[0076] At step B40a, the matching of tag identifiers is 
performed with respect to the currently chosen tag iden- 
tifier. If there are two ore more candidates that match 
the obtained partial tree identifier, the procedure goes 
to step B50a. Alternatively, if there is only one candidate, 
the procedure goes to step B60a. Alternatively, if there 
is no candidate, the procedure comes to an end. 
[0077] At step B50a, the screening of the candidates 
is performed by referring to the numerical value indexes 
associated with the tag identifier. If two or more candi- 
dates still remain after screening, the procedure pro- 
ceeds to step B80a. Alternatively, if only one candidate 
remains, the procedure goes to step B60a. If there is no 
candidate, the procedure comes to an end. 
[0078] At step B60a , since there is only one candidate 
that matches the obtained partial tree identifier, this can- 



didate partial tree is identified as the selected partial 
tree, followed by proceeding to step B70a. 
[0079] At step B70a, the partial-tree-identifier identi- 
fying unit 42 provides the user with a document portion 
5 belonging to the partial tree that corresponds to the ob- 
tained partial tree identifier. 

[0080] At step B80a, since two or more candidates re- 
main even after screening based on the utilization of nu- 
merical value indexes, a next matching element is cho- 
sen by ascending to the higher level. Namely, if the tag 
identifier table_border=0&cellpadding=1[1]" at the end 
of "doc.table_borcter=1&cellpadding =1 [0]. 
table J>order=0&cellpadding=1[1] B is used first, then, a 
parent tag identifier Table_border=1 & cellpadding=1 [0]' 
is chosen as a next matching element. The procedure 
then proceeds to step B90a. 

[0081] At step B90a, a check is made as to whether 
no more matching element exists by ascending to the 
next higher level at step B80a after using the tag at the 
highest level as a matching element If no more match- 
ing element exists, the procedure comes to an end. Oth- 
erwise, the procedure goes back to step B40a. 
[0082] The methods of the embodiments as de- 
scribed above may be implemented as programs, which 
are installed in a computer that is to be used as an ap- 
paratus for extracting information. Such programs may 
be distributed through networks. 
[0083] These programs may be stored in a hard-disk 
drive or a removable memory medium such as a flexible 
disk, a CD-ROM, or the like that is connected to the com- 
puter used as an information extracting apparatus, and 
may be loaded to the memory at the time of using the 
method of the invention. 

[0084] Fig.20 is a flowchart showing a schematic of 
the present invention that provides a user with a user 
interface for easy selection of a portion of a structured 
document such as an HTML document in a manner that 
is intuitively easy to understand. 
[0085] The present invention is directed to a method 
of selecting and extracting a portion of a structured doc- 
ument such as an HTML document. An end node at an 
end of a tree structure is identified that corresponds to 
a position indicated by a user on the screen that is dis- 
playing a document (step 1 ). The user is invited to select 
a node among a series of nodes that are obtained by 
successively detecting higher nodes from the end node 
(step 2). Then, a portion of the structured document cor- 
responding to the user-selected node is selected (step 
3). 

[0086] Fig.21 is a block diagram of a schematic user 
Interface according to the present invention. 
[0087] An apparatus for selecting and extracting a 
portion of a structured document such as an HTML doc- 
ument includes a node detecting unit 101 for detecting 
an end node of a tree structure that corresponds to a 
position indicated by a user on the document-displayed 
screen, a selection determining unit 102 for prompting 
the user to select a node from a series of nodes that are 
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obtained by successively detecting higher nodes from 
the end node, and a portion selecting unit 103 for se- 
lecting the portion of the structured document that cor- 
responds to the user-selected node. 
[0088] Fig.22 Is a block diagram of an apparatus ac- 
cording to an embodiment of the present invention. 
[0089] An apparatus 1 00 for selecting and extracting 
a portion of a structured document includes a display 
control unit 110 inclusive of a selected portion marking 
unit 111, an input unit 120, a tree-structure generating 
unit 1 30, and a display-portion storing unit 1 40. A display 
apparatus 1 0 and an input apparatus 20 are connected 
to the apparatus 100. 

[0090] The display apparatus 1 0 displays HTML text 
and images that are processed by a browser. 
[0091] The input apparatus 20 receives information 
specified by a user through button operation or the like. 
Such button operation includes area enlargement (+), 
size reduction (-), clear (dear), and select (select). 
[0092] The selected portion marking unit 111 of the 
display control unit 1 1 0 displays an object that is select- 
ed by a user button operation from objects at various 
levels of the tree structure. A portion selected as a de- 
sired portion by the user is stored in the display-portion 
storing unit 140 as a HTML text, for example. 
[0093] The input unit 1 20 receives user inputs (inputs 
through button operations) from the input apparatus 20, 
and passes the input information to the tree-structure 
generating unit 130 and the selected portion marking 
unit 111. 

[0094] The tree-structure generating unit 1 30 finds an 
object located at a position clicked by the user by se- 
lecting the object among objects that constitute the 
whole tree structure of the HTML document. The object 
found is stored in an array of objects. 
[0095] Fig.23 is a flowchart of a method of selecting 
and extracting a portion according to an embodiment of 
the present invention. 

[0096] At step 110, an HTML document to be proc- 
essed is displayed in the browser window of the user 
terminal. At step 120, the user clicks a portion that the 
user wishes to select on the screen. At step 130, the 
tree-structure generating unit 130 extracts an object cor- 
responding to the clicked position from the objects that 
constitute the HTML tree structure. A rectangular area 
corresponding to the extracted object is marked on the 
document displayed on the screen as shown in Fig.24. 
[0097] If the user determines the marked portion as 
his/her selection, the marked portion is stored in the dis- 
play-portion storing unit 1 40 as an HTML text (step 1 60) . 
Then, marking on the document is removed (step 180). 
If the user chooses not to select the marked portion, the 
user can enlarge ("+"), reduce ("-"), or dear ("dear") the 
marked area by operating the buttons shown on the 
screen (step 1 70). Through these button operations, ob- 
jects belonging to upper levels or lower levels of the tree 
structure are successively displayed. When a desired 
portion is marked on the screen, the marked portion is 



selected at step 1 50 by the select button ("select"). The 
selected portion is stored in the display-portion storing 
unit 140 as an HTML text (step 1 60). 
[0098] The procedure described above may be per- 

5 formed by a browser. In such a case, the HTML to be 
processed is provided with additional scripts written in 
Java Script, and is fed into the browser. 
[0099] In the following, the operation of the tree-struc- 
ture generating unit 130 will be described. 

io [0100] Rg.25 is a flowchart showing an operation of 
the tree-structure generating unit according to an em- 
bodiment of the present invention. 
[01 01 ] An array is initialized (step 1 31 ). An object lo- 
cated at a clicked position is detected (step 132), and is 

is stored in the array (step 1 33). Here, objects are part of 
the HTML document, and correspond to respective 
nodes of a tree structure. On the screen, there are areas 
that belong to respective objects. In the example of Fig. 
29, a tree structure is comprised of 13 nodes in total. If 

20 the detected object has a parent object (YES at step 

134) , this parent object is also stored in the array as an 
object belonging to the same dtcked position (step 1 33). 
This process is carried out with respect to all the object 
layers, generating an object array a corresponding to 

25 the dicked position. A dick on "apple" in Fig.29will result 
in objects "k, j, i, h, f, e, d, and a" being stored in the 
array a. 

[0102] Each element of this array is checked (step 

135) . This is intended to select an object of the highest 
30 level among objects that cannot be distinguished from 

each other from their appearance on the screen. Such 
cases occur when texts and images belonging to an ob- 
ject as well as texts and areas corresponding to the ob- 
ject are identical to those of other objects. 

35 [0103] If there is a next element, a check is made as 
to whether a text belonging to the next element is differ- 
ent (step 1 37). If it is different, the object is stored in an 
array b (step 138). Then, a check is made again as to 
whether there is a next element (step 136). If no next 

40 element exists, the object is stored in the array b (step 
139). With this, the procedure comes to an end. 
[0104] In this manner, the array b of objects is ob- 
tained where these objects correspond to the clicked po- 
sition and are distinguishable from each other on the 

45 screen. 

[0105] In the example of Fig.29, "k" and "j" have the 
same text "apple" belonging to them, and "f that is at 
the higher level is stored in the array b. V and "h" have 
the same texts "apple" and "orange" belonging to them, 
so and "h" that is at the higher level is stored in the array 
b. fa this example, T, "h", and "a" will be stored in the 
array b. 

[0106] In what follows, the selected portion marking 
unit 111 will be described. 
55 [Q107] Rg.26 is a flowchart of an operation of the se- 
lected portion marking unit 111 according to an embod- 
iment of the present invention. 
[0108] An object to be displayed is selected (step 
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141). In the case of initial presentation, a current ele- 
ment of the array b will be selected. In the case of en- 
larged presentation, an element of the array b next high- 
er than the current element will be selected. In the case 
of reduced-size presentation, an element of the array b 
next lower than the current element will be selected. 
[0109] At the initial presentation, a rectangular shape 
is extracted that corresponds to the lowest-level object 
of the object array b (step 142). The extracted rectangle 
is superimposed on the screen as shown in Fig.24 (step 
143). Among the buttons shown in Fig.24, the enlarge- 
ment button V will select an object next higher than the 
object corresponding to the currently selected area, re- 
sulting in the rectangle of the newly selected object be- 
ing superimposed on the screen. By the same token, 
the size-reduction button '-" will select the next lower 
object. With regard to the example of Fig.29, a rectangle 
area is superimposed on the display with respect to a 
corresponding object selected from the array b as 
shown in Rg.27. 

[01 1 0] I n the following, a system construction in its en- 
tirety will be described. 

[01 1 1 ] Fig .28 is an ill ustrative drawing showi ng a con- 
struction of the system according to an embodiment of 
the present invention. 

[0112] The system shown in Fig.28 includes a user 
client terminal 1 00 (apparatus for selecting and extract- 
ing a portion from a structured document), a relay server 
200, and a server 300 which stores an HTML document 
subjected to processing. 

[01 13] In respect of the HTML document subjected to 
processing, the relay server 200 "httpi/rwww.myserv. 
com/cgt-bir^get.cgi?http*ywww.foo. com/doc/htmr is 
provided for the purpose of allowing the operations as 
described above to be performed on the same screen 
that shows ■htlp^/www.foo.com/doc.htmr. 
[01 14] In the following description, numbers bracket- 
ed in "()" correspond to respective numbers bracketed 
in "0" in Fig.28. 

(1) From the client terminal 100, the user starts the 
CGI of the relay server 200 with reference to the 
URL of the HTML document subjected to process- 
ing. 

(2) The relay server 200 sends a request to the serv- 
er 300 by using the URL. 

(3) The server 300 transmits the HTML document 
to the relay server 200. 

(4) The relay server 200 adds a job script to the end 
of the HTML document obtained from the server 
300. 

(5) The relay server 200 transmits the HTML docu- 
ment to the client terminal 100 where the HTML 
document has an attached function for selecting 
and extracting a document portion. 

[0115] In this manner, the client terminal 100 can 
process the HTML document with the attached function 



of selecting and extracting a document portion. 
[01 1 6] Components of the apparatus for selecting and 
extracting a structured-document portion as described 
in these embodiments may be implemented as pro- 
5 grams, which are installed in a computer that is to be 
used as an apparatus for selecting and extracting a 
structured-document portion. Such programs may be 
distributed through networks. 

[0117] These programs may be stored in a hard-disk 
drive or a removable memory medium such as a flexible 
disk, a CD-ROM, or the like that is connected to the com- 
puter used as the apparatus for selecting and extracting 
a structured-document portion, and may be loaded to 
the memory at the time of using the method of the in- 
vention. 

[0118] Further, the present invention is not limited to 
these embodiments, but various variations and modifi- 
cations may be made without departing from the scope 
of the present invention. 

[01 1 9] The present application is based on Japanese 
priority application No. 2002-190621 filed on June 28, 
2002, and Japanese priority application No. 
2002-204641 filed on July 12, 2002, with the Japanese 
Patent Office, the entire contents of which are hereby 
incorporated by reference. 



Claims 

30 1. A method of extracting information from a struc- 
tured document, wherein the structured document 
is converted into a tree structure in order to identify 
a selected portion in the structured document after 
updating thereof, the selected portion being select- 
as ed in advance from the structured document, and 
the selected portion corresponding to a selected 
partial tree, comprising the steps of: 



assigning a partial tree identifier inclusive of a 
4° tag identifier to the selected partial tree wherein 

the tag identifier includes a name of a tag cor- 
responding to a root of said selected partial 
tree, a name of at least one format attribute of 
the tag, and a value of said at least one format 
45 attribute; 

arranging names of format attributes in a pre- 
determined order in the tag identifier if said at 
least one format attribute of the tag includes 
two or more format attributes; and 
so identifying a partial tree having a partial tree 

identifier identfcafto the partial tree identifier of 
the selected partial tree from a list of partial tree 
identifiers of partial trees that exist in the struc- 
tured document after updating thereof. 

55 

2. The method as claimed in claim 1 , wherein a plu- 
rality of tags in the structured document have the 
name of the tag and the name of said at least one 
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format attribute identical to those of the selected 
partial tree, said method further comprising the 
steps of: 

generating numerical value indexes that indi- * 
cate respective sequential numbers of tag iden- 
tifiers in the same hierarchical level of the tree 
structure; and 

combiningatag identifier and a numerical value 
index into a set, and connecting a plurality of 10 
sets of a tag identifier and a numerical value 
index in series from a root of the tree structure 
to the root of the selected partial tree, thereby 
producing the partial tree identifier. 

15 

3. The method as claimed in claim 2, wherein said step 
of identifying a partial tree includes the steps of: 

matching the partial tree identifiers of the partial 
trees with the partial tree identif ier of the select- 20 
ed partial tree by referring only to the tag iden- 
tifier located at an end of the partial tree identi- 
fier; 

screening candidates by referring to the numer- 
ical value indexes of the partial tree identifiers 25 
if two or more candidates of partial tree identi- 
fiers remain after said step of matching; 
recursively matching the partial tree identifiers 
of the partial trees with the partial tree identifier 
of the selected partial tree by successively as- 30 
cending to a next higher tag for use in the 
matching if two or more candidates of partial 
tree identifiers remain after said step of screen- 
ing; and 

identifying, as the selected partial tree, a partial 3s 
tree that remains alone after said step of recur- 
sively matching the partial tree identifiers. 

4. The method as claimed in claim 1 , further compris- 
ing the steps of: 40 

detecting an end node of the tree structure that 
corresponds to a position indicated by a user 
on a screen that displays the structured docu- 
ment; 45 
prompting the user to select a node from a se- 
ries of nodes that are obtained by successively 
detecting next higher nodes from the end node; 
selecting, as said selected portion, a portion of 
the structured document that corresponds to so 
the node selected by the user. 

5. A program for causing a computer to extract infor- 
mation from a structured document, wherein the 
structured document is converted into a tree struc- ss 
ture in order to identify a selected portion in the 
structured document after updating thereof, the se- 
lected portion being selected in advance from the 



structured document, and the selected portion cor- 
responding to a selected partial tree, said program 
comprising the steps of: 

assigning a partial tree identifier inclusive of a 
tag identifier to the selected partial tree wherein 
the tag identifier includes a name of a tag cor- 
responding to a root of said selected partial 
tree, a name of at least one format attribute of 
the tag, and a value of said at least one format 
attribute; 

arranging names of format attributes in a pre- 
determined order in the tag identifier if said at 
least one format attribute of the tag includes 
two or more format attributes; and 
identifying a partial tree having a partial tree 
identifier identical to the partial tree identifier of 
the selected partial tree from a list of partial tree 
identifiers of partial trees that exist in the struc- 
tured document after updating thereof. 

6. The program as claimed in claim 5, wherein a plu- 
rality of tags in the structured document have the 
name of the tag and the name of said at least one 
format attribute identical to those of the selected 
partial tree, said program further comprising the 
steps of: 

generating numerical value indexes that indi- 
cate respective sequential numbers of tag iden- 
tifiers in the same hierarchical level of the tree 
structure; and 

combining a tag identifier and a numerical value 
index into a set, and connecting a plurality of 
sets of a tag identifier and a numerical value 
index in series from a root of the tree structure 
to the root of the selected partial tree, thereby 
producing the partial tree identifier. 

7. The program as claimed in claim 6, wherein said 
step of identifying a partial tree includes the steps 
of: 

matching the partial tree identifiers of the partial 
trees with the partial tree identifier of the select- 
ed partial tree by referring only to the tag iden- 
tifier located at an end of the partial tree identi- 
fier; 

screening candidates by referring to the numer- 
ical value indexes of the partial tree identifiers 
if two or more candidates of partial tree identi- 
fiers remain after said step of matching; 
recursively matching the partial tree identifiers 
of the partial trees with the partial tree identifier 
of the selected partial tree by successively as- 
cending to a next higher tag for use in the 
matching if two or more candidates of partial 
tree identifiers remain after said step of screen - 
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ing; and 

identifying, as the selected partial tree, a partial 
tree that remains alone after said step of recur- 
sively matching the partial tree identifiers. 

5 

8. The program as claimed in claim 5, further compris- 
ing the steps of: 

detecting an end node of the tree structure that 
corresponds to a position indicated by a user 10 
on a screen that displays the structured docu- 
ment; 

prompting the user to select a node from a se- 
ries of nodes that are obtained by successively 
detecting next higher nodes from the end node; '5 
selecting, as said selected portion, a portion of 
the structured document that corresponds to 
the node selected by the user 

9. A computer readable medium having a program 20 
embodied therein for causing a computer to extract 
information from a structured document, wherein 
the structured document is converted into a tree 
structure in order to identify a selected portion in the 
structured document after updating thereof, these- 25 
lected portion being selected in advance from the 
structured document, and the selected portion cor- 
responding to a selected partial tree, said program 
comprising the steps of: 

30 

assigning a partial tree identifier inclusive of a 
tag identifier to the selected partial tree wherein 
the tag identifier includes a name of a tag cor- 
responding to a root of said selected partial 
tree, a name of at least one format attribute of 35 
the tag, and a value of said at least one format 
attribute; 

arranging names of format attributes in a pre- 
determined order in the tag identifier if said at 
least one format attribute of the tag includes 
two or more format attributes; and 
identifying a partial tree having a partial tree 
identifier identical to the partial tree identifier of 
the selected partial tree from a list of partial tree 
identifiers of partial trees that exist in the struc- 45 
tured document after updating thereof. 

1 0. The computer readable medium as claimed in claim 
9, wherein a plurality of tags in the structured doc- 
ument have the name of the tag and the name of so 
said at feast one format attribute Identical to those 

of the selected partial tree, said program further 
comprising the steps of: 

generating numerical value indexes that indi- 55 
cate respective sequential numbers of tag iden- 
tifiers in the same hierarchical level of the tree 
structure; and 



combining a tag identifier and a numerical value 
index into a set, and connecting a plurality of 
sets of a tag identifier and a numerical value 
index in series from a root of the tree structure 
to the root of the selected partial tree, thereby 
producing the partial tree identifier. 

1 1 . The computer readable medium as claimed in claim 
10, wherein said step of identifying a partial tree in- 
cludes the steps of: 

matching the partial tree identifiers of the partial 
trees with the partial tree identifier of the select- 
ed partial tree by referring only to the tag iden- 
tifier located at an end of the partial tree identi- 
fier; 

screening candidates by referring to the numer- 
ical value indexes of the partial tree identifiers 
if two or more candidates of partial tree identi- 
fiers remain after said step of matching; 
recursively matching the partial tree identifiers 
of the partial trees with the partial tree identifier 
of the selected partial tree by successively as- 
cending to a next higher tag for use in the 
matching if two or more candidates of partial 
tree identifiers remain after said step of screen- 
ing; and 

identifying, as the selected partial tree, a partial 
tree that remains alone after said step of recur- 
sively matching the partial tree identifiers. 

1 2. The computer readable medium as claimed in claim 
9, further comprising the steps of: 

detecting an end node of the tree structure that 
corresponds to a position indicated by a user 
on a screen that displays the structured docu- 
ment; 

prompting the user to select a node from a se- 
ries of nodes that are obtained by successively 
detecting next higher nodes from the end node; 
selecting, as said selected portion, a portion of 
the structured document that corresponds to 
the node selected by the user 

13. An apparatus for extracting information from a 
structured document, comprising: 

a tree-structure conversion unit which converts 
the structured document into a tree structure; 
and 

a partial-tree-identifier generating unit which 
assigns a partial tree identifier inclusive of a tag 
identifier to a partial tree of the tree structure 
wherein the tag identifier includes a name of a 
tag corresponding to a root of said selected par- 
tial tree, a name of at least one format attribute 
of the tag, and a value of said at least one for- 
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mat attribute. 

14. The apparatus as claimed in claim 13, wherein said 
partial-tree-identifier generating unit arranges 
names of format attributes in a predetermined order £ 
in the tag identifier if said at least one format at- 
tribute of the tag includes two or more format at- 
tributes. 

15. The apparatus as claimed in claim 13, wherein a io 
plurality of tags in the structured document have the 
name of the tag and the name of said at least one 
format attribute identical to those of the selected 
partial tree, and wherein said partial-tree-identifier 
generating unit generates numerical value indexes 
that indicate respective sequential numbers of tag 
identifiers in the same hierarchical level of the tree 
structure, and combines a tag identifier and a nu- 
merical value index into a set, followed by connect- 
ing a plurality of sets of a tag identifier and a numer- 20 
icaJ value index in series from a root of the tree 
structure to the root of the selected partial tree, 
thereby producing the partial tree identifier. 

16. The apparatus as claimed in claim 13, further com- 25 
prising: 

node detecting unit which detects an end node 
of the tree structure that corresponds to a po- 
sition indicated by a user on a screen that dis- 30 
plays the structured document; 
a selection determining unit which prompts the 
user to select a node from a series of nodes 
that are obtained by successively detecting 
next higher nodes from the end node; and 35 
a portion selecting unit which selects, as said 
selected portion, a portion of the structured 
document that corresponds to the node select- 
ed by the user. 

40 

17. A method of selecting and extracting a portion of a 
structured document, comprising the steps of: 

detecting an end node of the tree structure that 
corresponds to a position indicated by a user 45 
on a screen that displays the structured docu- 
ment; 

prompting the user to select a node from a se- 
ries of nodes that are obtained by successively 
detecting next higher nodes from the end node; so 
and 

selecting and extracting a portion of the struc- 
tured document that corresponds to the node 
selected by the user. 

55 

18. The method as claimed in claim 17, wherein said 
step of prompting the user includes the steps of: 



marking on the screen an area of a portion of 
the structured document that corresponds to 
one of said nodes; 

prompting the user to select a desired area by 
changing node selections; and 
determining a node corresponding to the se- 
lected desired area as a user selected node, 

wherein said step of selecting and extracting 
a portion of the structured document selects a por- 
tion of the structured document that corresponds to 
said user selected node. 

19. The method as claimed in claim 18, wherein said 
step of determining a node includes a step of se- 
lecting, as said user selected node, a node of a 
highest level from a plurality of nodes if said plurality 
of nodes correspond to said selected desired area. 

20. The method as claimed in claim 18, wherein said 
step of determining a node includes a step of se- 
lecting, as said user selected node, a node of a 
highest level from a plurality of nodes if said plurality 
of nodes include the same text and image data be- 
longing thereto. 

21. The method as claimed in claim 1 7, further compris- 
ing the steps of: 

transferring the structured document indicated 
by a user-specified URL to a relay server; 
attaching a script inclusive of a function to se- 
lect and extract a document portion to the struc- 
tured document at said relay server; and 
transferring the structured document having 
the attached script from said relay server to a 
user terminal, 

wherein said steps of detecting, prompting, 
and selecting are performed by use of the attached 
script at said user terminal. 

22. An apparatus for selecting and extracting a portion 
of a structured document, comprising: 

a node detecting unit which detects an end 
node of the tree structure that corresponds to 
a position indicated by a user on a screen that 
displays the structured document; 
a selection determining unit which prompts the 
user to select a node from a series of nodes 
that are obtained by successively detecting 
next higher nodes from the end node; and 
a portion selecting unit which selects and ex- 
tracts a portion of the structured document that 
corresponds to the node selected by the user. 

23. The apparatus as claimed in claim 22, wherein said 
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selection determining unit marks on the screen an 
area of a portion of the structured document that 
corresponds to one of said nodes, and prompts the 
user to select a desired area by changing node se- 
lections, followed by determining a node corre- s 
sponding to the selected desired area as a user se- 
lected node, wherein said portion selecting unit se- 
lects a portion of the structured document that cor- 
responds to said user selected node. 

10 

24. The apparatus as claimed in claim 23, wherein said 
determining unit selects, as said user selected 
node, a node of a highest level from a plurality of 
nodes if said plurality of nodes correspond to said 
selected desired area. 15 

25. The apparatus as claimed in claim 23, wherein said 
determining unit selects, as said user selected 
node, a node of a highest level from a plurality of 
nodes if said plurality of nodes include the same text 20 
and image data belonging thereto. 

26. The apparatus as claimed in claim 22, further com- 
prising: 

25 

a unit which transfers the structured document 
indicated by a user-specified URL to a relay 
server; 

a unit which attaches a script inclusive of a 
function to select and extract a document por- 30 
tion to the structured document at said relay 
server, and 

a unit which transfers the structured document 
having the attached script from said relay serv- 
er to a user terminal , 35 

wherein said node detecting unit, said selec- 
tion determining unit, and said portion selecting unit 
operate by use of the attached script at said user 
terminal. *o 

27. A program for selecting and extracting a portion of 
a structured document, comprising the steps of: 

detecting an end node of the tree structure that *5 
corresponds to a position indicated by a user 
on a screen that displays the structured docu- 
ment; 

prompting the user to select a node from a se- 
ries of nodes that are obtained by successively so 
detecting next higher nodes from the end node; 
and 

selecting and extracting a portion of the struc- 
tured document that corresponds to the node 
selected by the user. ss 

28. The program as claimed in claim 27, wherein said 
step of prompting the user includes the steps of: 



marking on the screen an area of a portion of 
the structured document that corresponds to 
one of said nodes; 

prompting the user to select a desired area by 
changing node selections; and 
determining a node corresponding to the se- 
lected desired area as a user selected node, 

wherein said step of selecting and extracting 
a portion of the structured document selects a por- 
tion of the structured document that corresponds to 
said user selected node. 

29. The program as claimed in claim 28, wherein said 
step of determining a node includes a step of se- 
lecting, as said user selected node, a node of a 
highest level from a plurality of nodes if said plurality 
of nodes correspond to said selected desired area. 

30. The program as claimed in claim 28, wherein said 
step of determining a node includes a step of se- 
lecting, as said user selected node, a node of a 
highest level from a plurality of nodes if said plurality 
of nodes include the same text and image data be- 
longing thereto. 

31. The program as claimed in claim 27, further com- 
prising the steps of: 

transferring the structured document indicated 
by a user-specified URL to a relay server; 
attaching a script inclusive of a function to se- 
lect and extract a document portion to the struc- 
tured document at said relay server; and 
transferring the structured document having 
the attached script from said relay server to a 
user terminal, 

wherein said steps of detecting, prompting, 
and selecting are performed by use of the attached 
script at said user terminal. 

32. A computer readable medium having a program 
embodied therein for causing a computer to select 
and extract a portion of a structured document, said 
program comprising the steps of: 

detecting an end node of the tree structure that 
corresponds to a position indicated by a user 
on a screen that displays the structured docu- 
ment; 

prompting the user to select a node from a se- 
ries of nodes that are obtained by successively 
detecting next higher nodes from the end node; 
and 

selecting and extracting a portion of the struc- 
tured document that corresponds to the node 
selected by the user. 
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33. The computer readable medium as claimed in claim 

32, wherein said step of prompting the user includes 
the steps of: 

marking on the screen an area of a portion of 5 
the structured document that corresponds to 
one of said nodes; 

prompting the user to select a desired area by 
changing node selections; and 
determining a node corresponding to the se- 10 
lected desired area as a user selected node, 

wherein said step of selecting and extracting 
a portion of the structured document selects a por- 
tion of the structured document that corresponds to is 
said user selected node. 

34. The computer readable medium as claimed in ciaim 

33, wherein said step of determining a node in- 
cludes a step of selecting, as said user selected 20 
node, a node of a highest level from a plurality of 
nodes if said plurality of nodes correspond to said 
selected desired area. 

35. The computer readable medium as claimed in daim 25 
33, wherein said step of determining a node in- 
cludes a step of selecting, as said user selected 
node, a node of a highest level from a plurality of 
nodes if said plurality of nodes include the same text 
and image data belonging thereto. 30 

36. The computer readable medium as claimed in daim 
32, said program further comprising the steps of: 

transferring the structured document indicated 35 
by a user-specified URL to a relay server; 
attaching a script indusive of a function to se- 
lect and extract a document portion to the struc- 
tured document at said relay server; and 
transferring the structured document having 40 
the attached script from said relay server to a 
user terminal, 

wherein said steps of detecting, prompting, 
and selecting are performed by use of the attached 45 
script at said user terminal. 
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FIG.4 

( START ) 



USE TAG IDENTIFIER AS IDENTIFIER 
OF CORRESPONDING PARTIAL TREE 



NORMALIZE TAG IDENTIFIER 
IF THERE ARE TWO OR MORE FORMAT k- S2 
ATTRIBUTES IN TAG IDENTIFIER 



SELECT PARTIAL TREE AS SELECTED 
PORTION FROM LIST OF IDENTIFIERS 
OF PARTIAL TREES IF PARTIAL TREE h~S3 

HAS THE SAME IDENTIFIER AS 
ALREADY SELECTED PARTIAL TREE 



( STOP ) 
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FIG.10 
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FIG.15 
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FIG.19 
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FIG.20 
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FIG.26 
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FIG.29 



HTML SOURCE 



BROWSER PRESENTATION 



|<html> 
<bead> 
<tide> 

sample 
</title> 
</head> 
<body> 

<table border=1> 

<tr> 
<td> 
fruits 
</td> 
<td> 
<table border=1> 

<tr> 
<td> 

apple 
</td> 
</tr> 
<tr> 
<td> 

orange 
</td> 
</tr> 
</table> 
</td> 
</tr> 
</table> 
</body> 
I </html> 
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